Fitted Q-iteration in continuous action-space MDPs
نویسندگان
چکیده
We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems.
منابع مشابه
Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs
In this paper we address reinforcement learning problems with continuous state-action spaces. We propose a new algorithm, tted natural actor-critic (FNAC), that extends the work in [1] to allow for general function approximation and data reuse. We combine the natural actor-critic architecture [1] with a variant of tted value iteration using importance sampling. The method thus obtained combines...
متن کاملAccelerated decomposition techniques for large discounted Markov decision processes
Many hierarchical techniques to solve large Markov decision processes (MDPs) are based on the partition of the state space into strongly connected components (SCCs) that can be classified into some levels. In each level, smaller problems named restricted MDPs are solved, and then these partial solutions are combined to obtain the global solution. In this paper, we first propose a novel algorith...
متن کاملSampling Based Fitted Value Iteration Finite-Time Bounds for Sampling-Based Fitted Value Iteration
In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision problems (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The con...
متن کاملEvaluation of Batch-Mode Reinforcement Learning Methods for Solving DEC-MDPs with Changing Action Sets
DEC-MDPs with changing action sets and partially ordered transition dependencies have recently been suggested as a sub-class of general DEC-MDPs that features provably lower complexity. In this paper, we investigate the usability of a coordinated batch-mode reinforcement learning algorithm for this class of distributed problems. Our agents acquire their local policies independent of the other a...
متن کاملReinforcement Learning with a Bilinear Q Function
Many reinforcement learning methods are based on a function Q(s, a) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q(s, a) = sWa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors that may have high dimension. We show that acti...
متن کامل